Search CORE

472 research outputs found

Robust Beam Search for Encoder-Decoder Attention Based Speech Recognition without Length Bias

Author: Ney Hermann
Schlüter Ralf
Zhou Wei
Publication venue
Publication date: 06/08/2020
Field of study

As one popular modeling approach for end-to-end speech recognition, attention-based encoder-decoder models are known to suffer the length bias and corresponding beam problem. Different approaches have been applied in simple beam search to ease the problem, most of which are heuristic-based and require considerable tuning. We show that heuristics are not proper modeling refinement, which results in severe performance degradation with largely increased beam sizes. We propose a novel beam search derived from reinterpreting the sequence posterior with an explicit length modeling. By applying the reinterpreted probability together with beam pruning, the obtained final probability leads to a robust model modification, which allows reliable comparison among output sequences of different lengths. Experimental verification on the LibriSpeech corpus shows that the proposed approach solves the length bias problem without heuristics or additional tuning effort. It provides robust decision making and consistently good performance under both small and very large beam sizes. Compared with the best results of the heuristic baseline, the proposed approach achieves the same WER on the 'clean' sets and 4% relative improvement on the 'other' sets. We also show that it is more efficient with the additional derived early stopping criterion.Comment: accepted at INTERSPEECH202

arXiv.org e-Print Archive

Crossref

Language Modeling with Deep Transformers

Author: Irie Kazuki
Ney Hermann
Schlüter Ralf
Zeyer Albert
Publication venue: 'International Speech Communication Association'
Publication date: 01/01/2019
Field of study

We explore deep autoregressive Transformer models in language modeling for speech recognition. We focus on two aspects. First, we revisit Transformer model configurations specifically for language modeling. We show that well configured Transformer models outperform our baseline models based on the shallow stack of LSTM recurrent neural network layers. We carry out experiments on the open-source LibriSpeech 960hr task, for both 200K vocabulary word-level and 10K byte-pair encoding subword-level language modeling. We apply our word-level models to conventional hybrid speech recognition by lattice rescoring, and the subword-level models to attention based encoder-decoder models by shallow fusion. Second, we show that deep Transformer language models do not require positional encoding. The positional encoding is an essential augmentation for the self-attention mechanism which is invariant to sequence ordering. However, in autoregressive setup, as is the case for language modeling, the amount of information increases along the position dimension, which is a positional signal by its own. The analysis of attention weights shows that deep autoregressive self-attention models can automatically make use of such positional information. We find that removing the positional encoding even slightly improves the performance of these models.Comment: To appear in the proceedings of INTERSPEECH 201

arXiv.org e-Print Archive

Crossref

Publikationsserver der RWTH Aachen University

Context-Dependent Acoustic Modeling without Explicit Phone Clustering

Author: Beck Eugen
Ney Hermann
Raissi Tina
Schlüter Ralf
Publication venue
Publication date: 15/05/2020
Field of study

Phoneme-based acoustic modeling of large vocabulary automatic speech recognition takes advantage of phoneme context. The large number of context-dependent (CD) phonemes and their highly varying statistics require tying or smoothing to enable robust training. Usually, Classification and Regression Trees are used for phonetic clustering, which is standard in Hidden Markov Model (HMM)-based systems. However, this solution introduces a secondary training objective and does not allow for end-to-end training. In this work, we address a direct phonetic context modeling for the hybrid Deep Neural Network (DNN)/HMM, that does not build on any phone clustering algorithm for the determination of the HMM state inventory. By performing different decompositions of the joint probability of the center phoneme state and its left and right contexts, we obtain a factorized network consisting of different components, trained jointly. Moreover, the representation of the phonetic context for the network relies on phoneme embeddings. The recognition accuracy of our proposed models on the Switchboard task is comparable and outperforms slightly the hybrid model using the standard state-tying decision trees.Comment: Submitted to Interspeech 202

arXiv.org e-Print Archive

Crossref

Improved training of end-to-end attention models for speech recognition

Author: Irie Kazuki
Ney Hermann
Schlüter Ralf
Zeyer Albert
Publication venue: 'International Speech Communication Association'
Publication date: 01/01/2018
Field of study

Sequence-to-sequence attention-based models on subword units allow simple open-vocabulary end-to-end speech recognition. In this work, we show that such models can achieve competitive results on the Switchboard 300h and LibriSpeech 1000h tasks. In particular, we report the state-of-the-art word error rates (WER) of 3.54% on the dev-clean and 3.82% on the test-clean evaluation subsets of LibriSpeech. We introduce a new pretraining scheme by starting with a high time reduction factor and lowering it during training, which is crucial both for convergence and final performance. In some experiments, we also use an auxiliary CTC loss function to help the convergence. In addition, we train long short-term memory (LSTM) language models on subword units. By shallow fusion, we report up to 27% relative improvements in WER over the attention baseline without a language model.Comment: submitted to Interspeech 201

arXiv.org e-Print Archive

Publikationsserver der RWTH Aachen University

Материальная и духовная культура армян Карабаха: проблемы развития и сохранения национального культурного наследия в 1920-1990-х гг.

Author: Macherey Wolfgang
Schlüter Ralf
Publication venue: Томский политехнический университет
Publication date: 01/01/1998
Field of study

Рассматриваются вопросы материальной и духовной культуры армян Карабаха, а также проблемы развития и сохранения их национального культурного наследия в 1920-1990-х гг

Electronic archive of Tomsk Polytechnic University

Об одном методе контроля работоспособности сдвигающего регистра

Author: Ney Hermann
Rybach David
Schlüter Ralf
Publication venue: Томский политехнический университет
Publication date: 01/01/1976
Field of study

Описывается функциональная схема устройства контроля работоспособности сдвигающего регистра, основанного на методе учета времени сдвига единицы через регистр. В устройстве используется две двухвходовые схемы совпадения, линия задержки

Electronic archive of Tomsk Polytechnic University

Crossref

Особенности внедрения концепции ТОС

Author: Ney Hermann
Schlüter Ralf
Wiesler Simon
Publication venue: Изд-во ТПУ
Publication date: 01/01/2011
Field of study

Electronic archive of Tomsk Polytechnic University

Crossref

Приближенные методы тепловых расчетов электрических машин с естественным охлаждением

Author: Ney Hermann
Nolden David
Schlüter Ralf
Publication venue: Томский политехнический университет
Publication date: 01/01/1965
Field of study

Electronic archive of Tomsk Polytechnic University